A proficient cost reduction framework for de-duplication of records in data integration

نویسندگان

  • Asif Sohail
  • Muhammad Murtaza Yousaf
چکیده

BACKGROUND Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc. METHODS In this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage). RESULTS The proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons. CONCLUSIONS The selection of the linkage key is a critical performance factor for record linkage.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Secure Integrity Verification in Cloud Storage Auditing with Deduplication

The cloud computing innovation appeared amid the21st century; outsourcing data to cloud benefit for capacity turns into a helpful yet proficient pattern, which benefits in saving endeavors on data support and administration. By the by, since the outsourced cloud stockpiling administration is not completely reliable, it raises security worries on the most proficient method to acknowledge data de...

متن کامل

Information Technology for Project Cost Management (Case study: Soufian Cement Co, Iran)

In today's competitive world, reduction of production costs has become one of the corporates priorities. Survival triangle (cost, quality and time) is the solution that helps companies focus on these three dimensions and have the ability to compete with other companies. Cost management is the first step in this way that providing solutions and advice to managers who need help to have a precise ...

متن کامل

Generalization of Decomposed Integration Methods for Cost Effective Heat Exchanger Networks with Multiple Cost Laws

At many circumstances, in heat exchange processes several exchangers were used with different cost laws due to their pressure ratings, materials of construction and exchange3r types. In such circumstances traditional methods of pinch technology can not be led to minimum total annual cost may cause some other disadvantages like more complexity or higher maintenance. In this research work a n...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • BMC medical informatics and decision making

دوره 16  شماره 

صفحات  -

تاریخ انتشار 2016